I haven’t included the code for some of the parts in order to keep this document as neat as I can.

setwd("/Users/deyapple/Documents/Courses/Term04/DA/Homework/hw_11")
library(tidyverse)
library(plotly)
library(highcharter)
library(ggmap)
library(leaflet)
library(plotly)
library(kableExtra)
library(ggthemes)
library(sp)
library(shiny)

disaster <- read_delim("data/disaster.txt", "\t", escape_double = F, trim_ws = T)
world <- map_data("world")
worldwide <- read_csv("data/worldwide.csv") ## world earthquake data

1 3D chart of Latitude, Longitude and Depth


To distinguish the points as much as possible, I’ve also based the color as well as the size on the Magnitude of each earthquake.









2 Tsunami Throughout The Years









3 Iran 2D Earthquake Density









4 Probility of High Intensity Earhquake in Iran


We want to calculate this using conditional probability. In order words we want to compute the following formula:
\[ P(B | A) = \frac{P(B \cap A)}{P(A)}\]

But first we need to find out what events exactly \(A\) and \(B\) are.
First we look at the most recent earthquake in Iran with an intensity higher than 7.


country year month day hour minute second intensity
IRAN 2017 11 12 18 18 17.0 7.3


We are now in the year 2018. So our condition is that in the past year, there have been no earthquakes with an intensity higher than 7.
So in conclusion we want to calculate the probability that there will be an earthquake in the next five years, given there have been none in the past year. So we can define the events \(A\) and \(B\) as follows:


\[\begin{cases} A & \quad \text{The high intensity earthquakes with no high intensity earthquakes in the following year. }\\ B & \quad \text{The high intensity earthquakes with a high intensity earthquake in the next five years.} \end{cases} \]


In order to calculate the formula above, all we need to do is calculate the number of times these events occur. That is \(|A|\) and \(|A \cap B|\).


A <- iequake.high %>% 
  arrange(year, month, day, hour, minute, second) %>% mutate(dif = lead(year) - year) %>% 
  filter(dif > 1)
AandB <- A %>% 
  filter(dif <= 5)


Probability:

nrow(AandB) / nrow(A)
[1] 0.173913





5 Map of Total and Average Number of Deaths by Earthquake






6 Predicting Number of Deaths

As we can see, this model behaves poorly. It seems that we cannot predict number of deaths based on latitude, longitude, intensity and depth of earthquake. We must try to find another model but since the question doesn’t ask for this, we’ll leave it here.


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/mk/jrl557gd0qz741jsx86f3gjr0000gn/T//RtmpCCa7go/h2o_deyapple_started_from_r.out
    /var/folders/mk/jrl557gd0qz741jsx86f3gjr0000gn/T//RtmpCCa7go/h2o_deyapple_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 minutes 2 seconds 
    H2O cluster timezone:       Asia/Tehran 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.20.0.2 
    H2O cluster version age:    1 month and 5 days  
    H2O cluster name:           H2O_started_from_R_deyapple_vcd248 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.3 (2017-11-30) 

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=================================================================| 100%
summary(hglm)
Model Details:
==============

H2ORegressionModel: glm
Model Key:  GLM_model_R_1532123019999_1 
GLM Model: summary
    family     link                              regularization
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 4.4018 )
  number_of_predictors_total number_of_active_predictors
1                          4                           4
  number_of_iterations training_frame
1                    1     death.stat

H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  241569675
RMSE:  15542.51
MAE:  3785.566
RMSLE:  NaN
Mean Residual Deviance :  241569675
R^2 :  0.01385919
Null Deviance :289793226199
Null D.o.F. :1182
Residual Deviance :285776925575
Residual D.o.F. :1178
AIC :26204.27





Scoring History: 
            timestamp   duration iterations negative_log_likelihood
1 2018-07-21 02:14:44  0.000 sec          0      289793226199.46198
        objective
1 244964688.24976

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Standardized Coefficient Magnitudes: standardized coefficient magnitudes
  names coefficients sign
1   int   722.634812  POS
2   lat   308.955574  POS
3   dep   188.673056  NEG
4  long    31.152321  POS





7 Predicting Magnitude of Main Earthquake Based on Pre-earthquakes

As we all know, pre-earthquakes happen at the same place, earlier on the same day as the actual earthquake. From this, we can conclude that dates during which only one earthquake occured at a particular time, does not involve any data related to pre-earthquakes. So we delete these days from the dataframe.
Also, the earthquake of the highest magnitude is considered the main earthquake and all other earthquakes occuring before it on the same date are considered pre-earthquakes.
Given all this, we construct the dataframe as follows:

The days involving a pre-earthquake:

quake.with.pre <- worldwide %>% 
  mutate(date = as.Date(time)) %>% 
  mutate(time = as.POSIXct(time)) %>%
  dplyr::select(place, date, pre_mag = mag, pre_time = time) %>% 
  group_by(place, date) %>% filter(n() != 1)

The main earthquake of those days:

main.quake <- quake.with.pre %>% 
  group_by(place, date) %>% slice(which.max(pre_mag)) %>% plyr::rename(c("pre_mag" = "main_mag", 
                                                                     "pre_time" = "main_time"))

The model predicting the magnitude of the main earthquake based on the pre-earthquake

pre.main <- quake.with.pre %>% full_join(main.quake, by = c("place", "date")) %>% 
  dplyr::filter(main_time > pre_time)

pm.model <- lm(data = pre.main, formula = main_mag ~ pre_mag)
summary(pm.model)

Call:
lm(formula = main_mag ~ pre_mag, data = pre.main)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6447 -0.3497 -0.1393  0.2287  3.7432 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.91570    0.05072   18.05   <2e-16 ***
pre_mag      0.89644    0.01467   61.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4879 on 1828 degrees of freedom
Multiple R-squared:  0.6713,    Adjusted R-squared:  0.6711 
F-statistic:  3733 on 1 and 1828 DF,  p-value: < 2.2e-16


The R-squared value is relatively high so our model seems to have performed well.






8 Magnitude Vs. Depth

map.depth <- worldwide %>% filter(type == "earthquake") %>% dplyr::select(depth, mag)
cor.test(map.depth$mag, map.depth$depth)

    Pearson's product-moment correlation

data:  map.depth$mag and map.depth$depth
t = 49.087, df = 60112, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1886153 0.2039871
sample estimates:
      cor 
0.1963132 

Based on the low \(p-value\) we can conlude that the magnitude and the depth of an earthquake are correlated.






9 Haarp Theory

Haarp theory, loosely interpreted, states that, throughout the last few years, the earthquakes have been more frequent and higher in magnitude.
In the worldwide dataframe, there is no column indicating the country. Will use the following lines of code to determine the country corresponding to each set of latitude and longitude:

library(sp)
library(rworldmap)
coords <- worldwide %>% select(longitude, latitude)
sp.countries <- getMap(resolution='low')
sp.points = SpatialPoints(coords, proj4string = CRS(proj4string(sp.countries)))  
info = over(sp.points, sp.countries)
worldwide$country <- info$ADMIN


I’ve also taken a sample of the data to show the countries calculated:

country place latitude longitude
Afghanistan 37km S of Jarm, Afghanistan 36.5261 70.8507
United States of America 82km E of Cantwell, Alaska 63.5055 -147.3217
Argentina 7km WSW of Yacuiba, Bolivia -22.0513 -63.7515
South Africa 14km ESE of Fochville, South Africa -26.5152 27.6312
United States of America 5km NW of The Geysers, CA 38.8100 -122.7972
United States of America 33km NW of Fairview, Oklahoma 36.4496 -98.7792
Bolivia 27km W of Chimore, Bolivia -17.0303 -65.3886
United States of America 143km SSE of Tok, Alaska 62.2060 -141.6305
Canada 79km WNW of Skagway, Alaska 59.7654 -136.5846
United States of America Wyoming 43.6615 -105.2913


Now I’ve drawn an animation to gain a feeling towards the theory. Since the number of countries was a whooping 117, I’ve sampled only 20 of them.












Now, in order to actually test the Haarp Theory using statistical models, we perform a One Way ANOVA test.


aov(data = country.info.all, count ~ year) %>% summary.aov()
             Df   Sum Sq Mean Sq F value Pr(>F)
year          3   280143   93381   0.658  0.579
Residuals   310 44014668  141983               
aov(data = country.info.all, avg_mag ~ year) %>% summary.aov()
             Df Sum Sq Mean Sq F value Pr(>F)
year          3   0.06 0.01838   0.086  0.968
Residuals   310  66.29 0.21383               

The null hypothesis is that the mean of all these groups is the same. Which basically means that the average number of earthquakes per year has not changed in the past few years.
Based on the very high \(p-value\) we cannot reject this null hypothesis, therefore we do not have enough evidence to conclude whether Haarp Theory is true or not.






10 Interesting Facts About Earthquakes

10.1 Deadly Earthquakes Are Not So Deadly Anymore. Or Are They?


I read on the internet that the deadliest earthquake known hit Shansi, China on January 23, 1556, and the estimated number of people who died was 830,000. We’ll see if we can find that out from the dataset

disaster %>%
  dplyr::select(location = LOCATION_NAME, year = YEAR, month = MONTH, day = DAY, deaths = DEATHS) %>%
  drop_na() %>%
  arrange(desc(deaths)) %>% head(1)
# A tibble: 1 x 5
  location                  year month   day deaths
  <chr>                    <int> <int> <int>  <int>
1 CHINA:  SHAANXI PROVINCE  1556     1    23 830000

Let’s see how many people in total have been killed by earthquakes.

disaster$DEATHS %>% sum(na.rm = T)
[1] 7753936

Keep in mind that this is the number of earthquakes that had NA in their DEATHS column:

disaster %>% dplyr::select(DEATHS) %>% filter(is.na(DEATHS)) %>% nrow()
[1] 4051

out of 6000 observations. So propbably a lot more than 8 million people have died as a cause of earthquake.


Let’s see if Technology and everything we have nowadays that people from a thousand years ago don’t have saved more lives from earthquakes or not.


First, I’ve performed a correlation test between DEATHS and YEAR

test.data <- disaster %>% dplyr::select(YEAR, DEATHS) %>% drop_na()
cor.test(test.data$DEATHS, test.data$YEAR)

    Pearson's product-moment correlation

data:  test.data$DEATHS and test.data$YEAR
t = -10.254, df = 1973, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2663975 -0.1826374
sample estimates:
      cor 
-0.224933 

Since the \(p-value\) is really small, we can conclude that the correlation isn’t zero. Also the correlation estimate is below zero so we can say that throughout time the number of deaths have decreased.


I’ve also made two groups based on the data. The first one is before the age of Technology, and the second is after it, which began in the second half of the 20th century.

before <- test.data %>% dplyr::filter(YEAR <= 1950)
after <- test.data %>% dplyr::filter(YEAR > 1950)

Then we can draw the density of the number of deaths to see wether they’re normal or not.

ggplot(test.data) + 
  geom_density(aes(x = DEATHS)) +
  theme_minimal()

As you can see, it’s not. So we’ll use a Mann-Whitney a.k.a a wilcoxon rank sum test on the data.

wilcox.test(before$DEATHS, after$DEATHS, alternative = "greater")

    Wilcoxon rank sum test with continuity correction

data:  before$DEATHS and after$DEATHS
W = 767120, p-value < 2.2e-16
alternative hypothesis: true location shift is greater than 0

Since the \(p-value\) is very small, we can reject the null hypothesis and conlude that there is a statistically significant difference in the mean of the two populations. Also since the alternative hypothesis is greater we can conclude that the average number of deaths before the Technology era was significanly more than after it.






10.2 Where Does All The Chaos Happen?


I’ve used the disaster dataframe.

Now I’ve drawn them on the world map:





As we can see there are two distinct parts of the world where most earthquakes occur. one is at the east edge of South America and the west bridge of Europe and Australia, the other is on the south of Asia and Europe.
These two parts are called The Ring of Fire, also called the Circum-Pacific belt, and The Alpide Belt.


I’ve also included pictures of these places so you can see the points on the map I made match.


The Ring of Fire:





The Alpide Belt:








10.3 Intensity Is Not A Synonym For Magnitude


The following paragraph is based on this webpage:

The magnitude of an earthquake is a measured value of the earthquake size. The magnitude is the same no matter where you are, or how strong or weak the shaking was in various locations. The intensity of an earthquake is a measure of the shaking created by the earthquake, and this value does vary with location.

So I’ve grouped the disaster dataset based on the magnitude of earthquakes and arranged them based on the number of earthquakes with the same magnitude:


# A tibble: 10 x 2
   MAGNITUDE count
       <dbl> <int>
 1       6     124
 2       5.5   113
 3       6.5    92
 4       7.5    92
 5       7      85
 6       6.8    82
 7       6.3    69
 8       6.7    68
 9       7.3    61
10       5.8    60

So as we can see, there are 70 earthquakes of the magnitude 6.0. Below is the location, intensity and number of deaths caused by a sample of the earthquakes with a magnitude of 6.0.


# A tibble: 10 x 3
   LOCATION_NAME                                        INTENSITY DEATHS
   <chr>                                                    <int>  <int>
 1 YEMEN:  DHAMAR                                               8   2800
 2 TURKEY:  USAK                                                8     21
 3 TURKEY:  VARTO, MUS                                          8     10
 4 CHINA:  GANSU PROVINCE:  HUIHUIPU                            8     36
 5 CHINA:  YUNNAN PROVINCE                                      7     50
 6 CHINA:  YUNNAN PROVINCE:   SONGMING, YILIANG                 8    614
 7 GUATEMALA:  S PALIN, SAN VICENTE PACAYA                      6      5
 8 CHINA:  GANSU PROVINCE:  XIHE                                8     30
 9 AFGHANISTAN:  HINDU KUSH:  CHITRAL, MARDAN, MALAKAND         4     11
10 OREGON:  KLAMATH FALLS                                       7      2


I’ve calculated the average number of deaths for each density and I’ve drawn the plot below:



As you can most definitely see, there is a vast difference between earthquakes with an intensity of 11 and other values. Since the difference is obvious from the chart, I won’t perform a hypotheis test anymore.


Below is the data collected for the chart above:


# A tibble: 7 x 2
  INTENSITY avg_death
      <int>     <dbl>
1         4      11  
2         6      44  
3         7      60.8
4         8     471. 
5         9      43.8
6        10     284. 
7        11   30018. 

To see the differences in the other values for intensity better I’ve also drawn this plot:


At last I’ve performed a corrlation test on the 70 earthquakes with the magnitude of 6.0 to see whether INTENSITY has a positive correlation with DEATHS or not.


cor.test(test.data$INTENSITY, test.data$DEATHS, alternative = "greater")

    Pearson's product-moment correlation

data:  test.data$INTENSITY and test.data$DEATHS
t = 3.0524, df = 68, p-value = 0.001619
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.1598527 1.0000000
sample estimates:
      cor 
0.3471384 

The \(p-value\) is less than 0.05, so we can reject the null hypothesis and kind of conclude that the higher the intensity is, the higher the number of deaths.